POINTBISERIALR
Overview
The POINTBISERIALR function calculates the point-biserial correlation coefficient, a measure of the strength and direction of association between a binary variable (coded as 0 and 1) and a continuous variable. This statistic is commonly used in psychometrics, educational testing, and social science research to assess relationships such as whether a treatment group (1) versus control group (0) differs on a continuous outcome measure.
The point-biserial correlation is mathematically equivalent to the Pearson correlation coefficient when applied to a dichotomous and continuous variable pair. Like other correlation coefficients, it ranges from -1 to +1, where 0 indicates no correlation, and values of -1 or +1 indicate a perfect determinative relationship between the variables.
This implementation uses SciPy’s pointbiserialr function from the scipy.stats module. The function returns both the correlation coefficient and a two-sided p-value based on a t-test with n-2 degrees of freedom.
The point-biserial correlation coefficient is calculated using the formula:
r_{pb} = \frac{\bar{Y}_1 - \bar{Y}_0}{s_y} \sqrt{\frac{N_0 N_1}{N(N-1)}}
where \bar{Y}_0 and \bar{Y}_1 are the means of the continuous variable for observations coded 0 and 1 respectively, N_0 and N_1 are the counts of observations in each group, N is the total sample size, and s_y is the standard deviation of the continuous variable.
A significant point-biserial correlation (p-value below a chosen threshold such as 0.05) is equivalent to finding a significant difference in means between the two groups via an independent samples t-test. The relationship between the t-statistic and r_{pb} is given by:
t = \sqrt{N-2} \cdot \frac{r_{pb}}{\sqrt{1 - r_{pb}^2}}
For additional background on the point-biserial correlation, see Tate (1954) and the Wiley StatsRef entry on Point Biserial Correlation.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=POINTBISERIALR(x, y)
x(list[list], required): Binary variable (column vector of 0s and 1s)y(list[list], required): Continuous variable (column vector), same length as x
Returns (list[list]): 2D list [[correlation, p_value]], or error message string.
Examples
Example 1: Demo case 1
Inputs:
| x | y |
|---|---|
| 0 | 1 |
| 0 | 2 |
| 0 | 3 |
| 1 | 4 |
| 1 | 5 |
| 1 | 6 |
| 1 | 7 |
Excel formula:
=POINTBISERIALR({0;0;0;1;1;1;1}, {1;2;3;4;5;6;7})
Expected output:
| Result | |
|---|---|
| 0.866 | 0.0117 |
Example 2: Demo case 2
Inputs:
| x | y |
|---|---|
| 0 | 1 |
| 0 | 1 |
| 1 | 5 |
| 1 | 5 |
Excel formula:
=POINTBISERIALR({0;0;1;1}, {1;1;5;5})
Expected output:
| Result | |
|---|---|
| 1 | 0 |
Example 3: Demo case 3
Inputs:
| x | y |
|---|---|
| 0 | 10 |
| 0 | 8 |
| 0 | 9 |
| 1 | 2 |
| 1 | 3 |
| 1 | 1 |
Excel formula:
=POINTBISERIALR({0;0;0;1;1;1}, {10;8;9;2;3;1})
Expected output:
| Result | |
|---|---|
| -0.9739 | 0.001 |
Example 4: Demo case 4
Inputs:
| x | y |
|---|---|
| 0 | 1 |
| 0 | 5 |
| 1 | 3 |
| 1 | 3 |
Excel formula:
=POINTBISERIALR({0;0;1;1}, {1;5;3;3})
Expected output:
| Result | |
|---|---|
| 0 | 1 |
Python Code
from scipy.stats import pointbiserialr as scipy_pointbiserialr
def pointbiserialr(x, y):
"""
Calculate a point biserial correlation coefficient and its p-value.
See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pointbiserialr.html
This example function is provided as-is without any representation of accuracy.
Args:
x (list[list]): Binary variable (column vector of 0s and 1s)
y (list[list]): Continuous variable (column vector), same length as x
Returns:
list[list]: 2D list [[correlation, p_value]], or error message string.
"""
# Helper function to convert scalar or 2D list inputs
def to2d(val):
return [[val]] if not isinstance(val, list) else val
# Normalize inputs
x = to2d(x)
y = to2d(y)
# Flatten 2D lists to 1D and convert to numeric
try:
x_flat = []
for row in x:
if isinstance(row, list):
x_flat.extend(row)
else:
x_flat.append(row)
y_flat = []
for row in y:
if isinstance(row, list):
y_flat.extend(row)
else:
y_flat.append(row)
x_array = [float(val) for val in x_flat]
y_array = [float(val) for val in y_flat]
except (ValueError, TypeError) as e:
return f"Invalid input: x and y must contain numeric values. {str(e)}"
# Check that arrays have the same length
if len(x_array) != len(y_array):
return "Invalid input: x and y must have the same length."
# Check minimum length
if len(x_array) < 3:
return "Invalid input: arrays must contain at least 3 elements."
# Validate that x contains only binary values (0 or 1)
x_unique = set(x_array)
if not x_unique.issubset({0.0, 1.0}):
return "Invalid input: x must contain only binary values (0 or 1)."
# Check that we have both 0 and 1 values in x
if len(x_unique) < 2:
return "Invalid input: x must contain both 0 and 1 values."
# Check for constant y values
if len(set(y_array)) == 1:
return "Invalid input: y must contain varying values (not all identical)."
try:
# Calculate point-biserial correlation
result = scipy_pointbiserialr(x_array, y_array)
correlation = float(result.statistic)
pvalue = float(result.pvalue)
# Return as 2D list (single row, two columns)
return [[correlation, pvalue]]
except Exception as e:
return f"Error calculating point-biserial correlation: {str(e)}"